Aki Shiroshita (Epidemiology PhD student, akihiro.shiroshita@vanderbilt.edu) developed a tailored version of DeGAUSS specifically for the EV project.
About Original DeGAUSS
DeGAUSS (https://degauss.org/) is designed to derive environmental variables while preserving the privacy of protected health information (PHI). It uses Docker images to process address data, Users upload a CSV file containing address information and receive an output file with various environmental variables.
Limitations of Original DeGAUSS
Not optimized for very large datasets.
Requires input and output files to be stored in the same folder.
Improvements in the Modified DeGAUSS
Avoid reliance on Docker as much as possible.
Utilize C and C++ in the backend wherever possible.
Enable parallel processing with multiple cores.
Restrict environmental data to Tennessee only, not the entire U.S.
Modified DeGAUSS provides clean, processed output files with all PHI removed.
How to Use Modified DeGAUSS
The environment has already been set up for you. All you need to do is follow the instructions.
Step-by-Step Instructions
- Locate the Folder:
Navigate to the folder “C:_degauss_2025_08_14” on the Windows server (Cqshealth.dhcp.mc.vanderbilt.edu).
- Open R Project:
Launch R Studio.
Note: It may take 1–2 minutes to open, as the R Studio settings have been customized for this project. Please wait each time you run the program until items appear in the environment.
- Start Docker Desktop:
Open Docker Desktop for Windows.
- Run the Script:
Open the file test.R.
Execute the script section by section using the shortcut:
Place your cursor in the section and press Ctrl + Alt + T.
- Locate Output Files:
Processed data will be saved in any folder of your choice.
This folder contains CSV files, including: tract.csv
(used for subject selection flow), final_data.csv (the
final dataset for sharing with other researchers, with all PHI removed),
tab_census.csv (census tract tabulation data), and
tab_relocation.csv (relocation information).”
Specific instructions for Huiping
- Map your shared drive containing address information to our Windows server
Note: Your data will remain on the shared drive and will never leave the VUMC environment.The server will load data into memory for processing, but data will not be stored in local server folders. Any temporary cache generated during processing will be automatically removed.
- Folder choice
Could you provide the path to the input folder containing the address data and the file name?
What is the path to the output folder where you’d like to store the processed data after removing all PHI data?
If you would like to create a temporary folder in a different location to store intermediate files containing PHI, please specify the path.
Defining start date and end date
For defining start date and end data, we
need merge any overlapping or adjacent enrollment periods into single,
continuous time spans. This ensures there are no gaps in the
timeline.
TennCare enrollment file is like this:
| recip | enrol_begin_date | enrol_end_date | address |
|---|---|---|---|
| 1 | 2023-01-01 | 2024-01-02 | 123 Main St |
| 1 | 2024-01-02 | 2025-03-02 | 456 Elm St |
| 2 | 2022-01-02 | 2023-o1-02 | 789 Oak St |
not like this:
| recip | registration_date | address |
|---|---|---|
| 1 | 2023-01-01 | 123 Main St |
| 1 | 2024-01-02 | 456 Elm St |
| 2 | 2022-01-02 | 789 Oak St |
Delete modified DeGAUSS
Once all processes are completed and the required outputs are finalized, I will delete the modified DeGAUSS from the server.